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Introduction 

Assessment is a fearful topic to many people, even those involved in education on a regular ba- 
sis. The purpose of this document is to explore the world of assessment within the context of the No 
Child Left Behind Act of 2001. The definitions provided will help those with little knowledge of as- 
sessment to understand the essentials of practice and theory. The information is meant for assess- 
ment users so they can interpret the purpose of tests and test scores appropriately and explain them 
to others. It is not meant to provide in-depth knowledge. The document also provides background 
for other NCELA documents in the Definitions for No Child Left Behind series: Scientifically-Based 
Research, Research and Evaluation that Work within NCLB Standards, and Criteria for Evaluating 
Evidence-Based Research. 

No Child Left Behind Act of 2001 

On December 13, 2001, the 107th Congress passed the No Child Left Behind Act of 2001 
(NCLB), the latest reauthorization of the Elementary and Secondary Education Act of 1965 (ESEA); 
President George W. Bush signed the legislation in January 2002. With this legislation, Congress 
and the President encourage the use of annual assessment of all students to promote high quality 
education. Both Title I: Improving the Academic Achievement of the Disadvantaged and Title III: 
Language Instruction for Limited English Proficient and Immigrant Students include statements 
about measuring language proficiency and academic achievement using high quality assessments. 
These mandates represent an opportunity for states and districts to develop and maintain a full as- 
sessment system that meets their own needs as well as those of the federal Department of Educa- 
tion. This assessment system must include multiple, up-to-date, quality measures that encompass a 
reasonable portion of the curriculum. The assessments must be 

4- valid, reliable, and fair to all students; 

4- available to students in a manner that allows them to show what they know; and 
+ available in the student’s home language for at least the first year of schooling in the US. 

There is no doubt that we must assess all students in order to determine their educational pro- 
gress. Assessment provides important information for accountability to teachers, administrators, 
parents, the community, and the students themselves. However, we must be careful to align the 
need for accountability with quality instruction and assessments, providing a system that is appro- 
priate for all students. When accountability includes high-stakes decisions about grade promo- 
tion/retention, placement in core content classes, or academic achievement, it is imperative that the 
system address the unique characteristics and needs of all students: students living in poverty, Eng- 
lish language learner (ELL) students, culturally diverse students, and so on. Only then can we de- 
termine the best way to assess all students' achievements and make instructional decisions. 

Assessment 

Assessment is a broad term that involves the collection and maintenance of various types of 
data about students including norm-referenced tests, criterion-referenced tests, classroom-based 
assessments of various types, and performance-based tasks. We use the term “assessment” 
throughout this document to refer to any situation in which students must respond to items or tasks 
in order to demonstrate their knowledge and/or skills in a specific area. Using the appropriate type 
of assessment for a specific purpose is important to the validity and fairness of that assessment. A 
particular assessment can be reliable, valid, and fair for one purpose, but not for another. For in- 
stance, the Iowa Test of Basic Skills (ITBS) may be valid, reliable, and fair for measuring language 
arts achievement, but not for measuring English language proficiency. 

An assessment system must include the technical standards of validity, reliability, and fairness 
as well as respond to issues of bias and the interpretation and use of assessment results. Different 



types of assessments can and should be used within an appropriate assessment system. Each as- 
sessment must be considered carefully and should be related to other assessments in order to pro- 
vide a thorough picture of each individual student. Each assessment should provide the classroom 
teacher, the school, and/or the school district with information about the students they are serving 
and, by implication, the teachers who are working with the students. Though it does not specifically 
mention different types of assessments, the No Child Left Behind Act of 2001 is clear in referring to 
multiple assessments (usually by mentioning “assessments” [emphasis added]). Some examples 
from Title I and Title III of NCLB include: 

■4- assessments that “involve multiple up-to-date measures of student achievement, including 
measures that assess higher-order thinking skills and understanding” (§llll(3)(C)(vi)), 

* ensuring that high quality assessments are used, including those that are valid and reliable 
(§1001(1), §llll(b)(2)(A)(i), §llll(b)(2)(D)(ii), §llll(b)(3)(C)(iii-iv), §1112(b)(l)(A), §3121(a)(3)), 

4- using multiple up-to-date assessments (§llll(b)(3)(C)(vi)), 



4- other academic indicators such as State or locally administered assessments 
(§llll(b)(2)(C)(vii), §1116(a)(l)(B), §llll(b)(4)), 



4 assessments in various languages (§llll(b)(6), §llll(b)(3)(C)(ix)(lll)), 

■>4- assessments of English language proficiency (§llll(b)(7), §3121(a)(3), §3121(d) (1-2)), and 
4- assessments of various content areas (§llll(b)(3)(A), §llll(b)(3)(C)(v)(l), §1116(b)(3)(A)(ii)). 



Academic Achievement 

Academic achievement refers to students’ concepts, skills, and knowledge in the core content 
areas of reading and/or language and math as well as science and history or social studies. In order 
to be successful in academic areas, students must have (1) the opportunity to learn the material and 
(2) the opportunity to demonstrate that they know the material. Academic achievement is specifi- 
cally classroom-based, taking place within schools. Several sections of both Title I and Title III of 
NCLB refer to assessing students’ achievement in core content areas; specifically, districts and 
states are responsible for 

* ensuring that core academic subjects are assessed in manners that are appropriate for all stu- 
dents (§llll(b)(3)(C)(ix) and §1116(b)(3)(A)(ii)), 

4- ensuring valid and reliable assessments of academic achievement standards (§3121(d)(2) and 

§3121(a)(3)), and 

4- states “shall make every effort to develop” assessments in other languages (§llll(b)(6)). 

As a first layer of definition, a norm-referenced, criterion-referenced, or alternative-based as- 
sessment of academic achievement describes current levels of knowledge, attitudes, proficiencies, 
and skills. 

Norm-referenced tests 

Standardized assessments can be used to measure participants’ skills and knowledge. They are 
so named because administration, format, content, language, and scoring procedures are the same 
for all participants - these features have been “standardized.” Locally-developed and commercially- 
available standardized assessments have been created to assess achievement in most content ar- 
eas. Items generally are multiple-choice or the newer extended multiple-choice which require a stu- 
dent to select the correct response option and then to justify why it is correct. When considering the 
definition of “standardized test,” it is clear that all high-stakes tests should be standardized to some 
extent, whether they are commercially available or locally developed. 

When referring to standardized assessments, most people think of norm-referenced tests 
(NRTs). NRTs typically are used to sort people into groups based on their assumed skills in a particu- 
lar area (for instance, those in the top 10% of skills). They are useful when selecting participants for 
a particular program because they are designed to differentiate among test-takers. In addition, NRTs 
can provide general information that will help to match classrooms for overall achievement levels 
before assigning them to a particular program. 

Publishers of nationally developed NRTs administer them to many students across the nation as 
part of the development process. These students become the norm group. The students are se- 
lected to be “typical” of students across the nation who are receiving a “conventional” curriculum 
and should represent populations with whom the test ultimately will be used. Anyone using NRTs 
should check with the publisher’s technical manual for a description of the norm group since norm 
groups provide a standard against which the local schools can compare their own students. For in- 
stance, if a school’s population includes African-Americans, Chinese-Americans, and English- 
speaking White students, then administrators at the school will want to ensure that these three eth- 
nicities/races were part of the norm group; if they were not in the norm group, then the appropriate- 
ness of the NRT for this school’s population of students may be questionable. In addition, school 
staff should ensure that the purposes of the assessment and the content of the assessment match 
the goals and content of the school’s curriculum. Typically, the match between local content and 
NRT items is less than 50%, but since districts across the country face the problem of aligning cur- 



ricula, assessments, and standards, this is improving. When using a nationally available NRT, the 
technical manual should be reviewed for this and other information. 

Criterion-referenced tests 

Criterion-referenced tests (CRTs) measure how much or whether specific knowledge has been 
gained; that knowledge is the criterion against which the participant is measured. CRTs usually con- 
tain multiple-choice items, or the newer extended multiple-choice items. Responses generally are 
marked as “correct” or “incorrect.” A score of 80% correct usually is considered as mastery of the 
knowledge being measured; a score of at least 50% correct indicates that a participant has sufficient 
understanding of the content to move on to the next level of information or topic to be learned. 

CRTs must be aligned closely to the curriculum (which of course must be aligned closely to the 
district or state content standards) in order to ensure that what is being tested is what has been 
taught. CRTs should be used before the content is taught, then repeated after the content is taught, 
thus ensuring that students’ knowledge is based on what was taught; i.e., that they did not know the 
content before instruction. 

Attempts have been made to create assessments that could be used in both norm-referenced 
and criterion-referenced manners. These attempts generally fail because the purposes of the two 
types of assessment are so different. 

Alternative assessments 

Alternative assessments are types of measures that fit a contextualized measurement approach, 
meaning that they can be incorporated easily into classroom routines and learning activities. Their 
results are indicative of the participant’s performance on the skill or subject of interest. As used 
within this document, “alternative assessment” subsumes authentic assessment, performance- 
based assessment, informal assessment, ecological assessment, curriculum-based assessment, 
and other similar forms that actively involve the participant. Some example activities and products 
that can be used as alternative assessments are listed in Exhibit 1. 



Exhibit 1: Example activities that can be used as alternative assessments 



■ Essays and report 




■ Poetry and creative writing 


■ Story retelling 


■ Journal entries and logs 


■ Collaborative work 




■ Homework 


■ Posters, artistic media 


■ Reading lists 




■ Games 


■ Brainstorming 




■ Writing samples 




■ Debates, presentations 


■ Observations 




■ Anecdotal records 




■ Peer reviews 


■ Questionnaires 




■ Cloze tests 




■ Miscue analysis 


- CRTs 




■ Teacher and student 


: checklists 





Alternative assessments often are referred to as measuring whether the student can think be- 
cause they generally do not ask the student to identify a correct answer, but rather to consider the 
information they have, modify their knowledge, and then apply it to a specific problem. In this way, 
alternative assessments tap higher order thinking skills more frequently than do multiple choice 
NRTs. 

Note: Alternative assessment should not be confused with alternate assessment. The former 
can be used with any population for whom we want to show progress, it is an alternative to formal, 
standardized testing. The latter is a type of assessment used with populations who cannot complete 
the content or format of assessments used with mainstream populations. 



Language proficiency 

Language proficiency definitions vary by state but generally refer to both productive (speaking, 
writing) and receptive (reading, listening) skills, as first recommended by the Council of Chief State 
School Officers (1992). Assessing language proficiency is a difficult issue. Language proficiency as- 
sessments 

* must be appropriate for students of different cultural, ethnic, social, and educational back- 
grounds; 

4- are assumed to be able to predict how well a student will do in academic classes although they 
do not include information about cognitive abilities or academic achievement; and 

4- tend to measure specific aspects of language (e.g., word choice, grammar) rather than overall 
communicative competence, which has a repertoire of communication skills that can be used in 
a variety of situations. 

Students can acquire English through multiple sources such as school, playground, church, tele- 
vision and radio, as well as neighborhood children. It is important that language proficiency assess- 
ments measure not only a conversational level of English, but also the academic English necessary 
to function on grade level in all-English-language classrooms. 



Both Title I and Title III refer to assessing students’ language proficiency. Specifically, 

v schools, districts, and states are responsible for ensuring valid and reliable assessments of Eng- 
lish proficiency (§3121(a)(3) and (d)(1)) and 

4- Local Education Agencies must “...provide for an annual assessment of English proficiency 
(measuring students' oral language, reading, and writing skill in English) of all [ELL] students” 
(§1111 (b)(7)). 

Testing the four modes 

There was a time when students' skills in speaking English were the sole measure of their ability 
to participate in an English-speaking, mainstream classroom. In 1992, the Council of Chief State 
School Officers formally introduced the concept of assessing all four modes of language proficiency, 
listening, speaking, reading, and writing, in order to ensure that students’ English language capabili- 
ties would allow them to participate fully in the English-speaking, mainstream classroom. Some lan- 
guage proficiency instruments measure only one mode (e.g., the Student Oral Language Observation 
Matrix [SOLOM]) while others measure all four modes (e.g., the Language Assessment Scales [LAS] if 
using both the LAS-Oral and the LAS-Reading/Writing or the IDEA Proficiency Test [IPT]). Each as- 
sessment instrument is somewhat different and must be reviewed carefully to determine whether its 
cost, time, administration, and purpose meet the needs of the district. More importantly, language 
arts achievement tests (e.g., the SAT-9 or TerraNova) cannot be used to assess language proficiency; 
language arts achievement is not language proficiency. 

Comprehension 

NCLB states that the four modes of English proficiency must be assessed as well as “compre- 
hension.” At the present time, it is not clear exactly what is meant by “comprehension.” The follow- 
ing definitions have been suggested and should be considered by those selecting language 
proficiency assessments for students. 



1. A student whose average score on the assessment of listening and reading is above a “lim- 
ited English proficient” level is demonstrating “comprehension.” 

2. A student who understands what is happening in class, follows directions, and generally 
seems to grasp the overall pattern of the class, is demonstrating “comprehension.” 

3. A student whose language proficiency testing indicates that s/he is proficient in English is, 
by definition, demonstrating “comprehension.” 

4. A student whose average score on the assessment of all four modes is above a “limited 
English proficient” level is demonstrating “comprehension.” 

5. A student who can respond to basic questions (e.g., name, age, description of family mem- 
bers) prior to the administration of a language proficiency test is demonstrating “compre- 
hension.” 

6. A student who can follow the instructions for an assessment, and understand its purpose 
and importance, is demonstrating “comprehension.” 

It does not appear that a specific test of comprehension is required, so schools, districts, and states 
should work together to define “comprehension” in a manner that meets their expectations of stu- 
dents in English-speaking classrooms. 



Scoring mechanisms 

Scoring an assessment is nearly as important as ensuring that the assessment is valid, reliable, 
and fair. NRTs, CRTs, and alternative assessments can be scored in several different ways, but the 
scores are only as helpful as they are understandable and useful. The interpretation of scores can 
be confusing and can lead to erroneous conclusions about students’ performances. Some of the 
more often-used scoring mechanisms are defined briefly below. 

Scores for NRTs 

Raw scores provide the number of items answered correctly. These numbers can be manipu- 
lated mathematically to give an average correct score for a classroom or a grade level. An average 
that includes other grade levels is only possible, however, if all students took exactly the same test; 
an average across different tests is not appropriate since the tests will have different numbers of 
items, cover different content areas, and have different levels of difficulty. 

Grade equivalents or grade placement scores indicate how well a student is doing relative to 
other students in the same grade. They are stated in tenths of a school year (assuming 10 months 
in a school year), so 7.3 indicates the third month of seventh grade. These scores are extrapolated; 
they only estimate the relationship between grade levels and test scores with the assumption that 
students gain knowledge in a predictable upward fashion. Test publishers generally do not test stu- 
dents with a given form of the assessment during every month of each year of school so, again, we 
do not know exactly how students score during each month of a given grade level, or how students 
from different grade levels would fair on the test. 

These scores are frequently misinterpreted because they are based on the tenuous assumptions 
that 

(1) what is being tested is studied by students consistently from one year to the next, 

(2) a student’s increase in competence is essentially constant across the years, and 

(3) tests reasonably sample what is being taught at all of the grade levels for which scores are 
being reported. 

In addition, grade equivalents cannot be summed, averaged, or combined in any way because they 
are not true numeric scores but represent grade levels - a category rather than a score. They are 
best used to indicate that students are performing at or near grade-level expectations, above expec- 
tations, or below expectations - with no further interpretation or estimate of how much above/below 
expectations the student scored. 

Stanines provide a rough approximation of an individual's performance relative to the perform- 
ance of other students. Originating from the term “standard nine,” stanines divide the range of 
scores on a test into nine equal groupings. The score of 1 stanine represents the lowest of the nine 
groups and a 9 represents the highest scoring group. Because of the general nature of stanines, 
these gross descriptors may communicate individual test results, but using stanines to report group 
data misrepresents the precision of the data-gathering instruments and forms. As an example, the 
first, lowest, stanine may include raw scores from 1 to 30. This broad grouping of scores is not only 
an imprecise method of reporting, but also will make it difficult to show increases - a student may 
move from a raw score of 2 correct to a raw score of 28 correct, but remain in the lowest stanine. 
Stanines cannot be used for describing the average achievement level of a class or a group of stu- 
dents; they cannot be averaged, summed, or combined in any way because of the number of raw 
scores subsumed in each stanine. Exhibit 2 is a graph that indicates the relationship between 
stanines several scoring mechanisms, including stanines, and the normal curve. 



Exhibit 2: Scores on a normal curve 



Relation Between Stanines, NCEs, and PRs 
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Percentiles are used frequently, and are misinterpreted frequently. They range in value from 1 to 
99, indicating the percentage of students scoring at or lower than the test score in question. For 
example, a student scoring at the 70 th percentile scored equal to or better than 70% of the students 
who took the test; s/he scored higher than 69% of the others. Percentiles are designed to match a 
normal curve, so 50% of students should score above the 50 th percentile and 50% should score be- 
low the 50 th percentile. Percentiles cannot be averaged, summed, or combined in any way because 
they are not equal-sized units. As demonstrated in Exhibit 2, percentiles at each end of the normal 
curve (e.g., scores 1-5) are much larger than those at the middle of the normal curve (e.g., compare 
scores 1-5 with scores 40-50). 

Percentiles should be used only to describe how a student is scoring in relationship to her/his 
peers. An additional problem with percentiles is that they have no relationship to the actual score 
achieved. That is, a student may score 29 correct out of 100 items, but if everyone does poorly on 
the assessment, this low actual score still could put the student at the 90 th percentile - looking 
falsely high. 

NRTs often report the percentage of students scoring above the 25 th , 50 th , and/or 75 th percen- 
tile. In such cases, it may be possible to show some growth by reporting these numbers. As an ex- 
ample, an evaluation may indicate that “in the 2001-2002 school year, 30% of the 4 th graders 
scored above the 50 th percentile but in the 2002-2003 school year, 50% of this same group (now 5 th 
graders) scored above the 50 th percentile, indicating a growth in achievement.” 



There is a whole set of scoring types referred to as standard scores. All standard scores are re- 
calculated raw scores. As standard scores, each has a predetermined average and a predetermined 
standard deviation (a measure of how much the scores vary within the group of students taking the 
assessment - a small standard deviation indicates that the group scored similarly while a large 
standard deviation indicates that the group’s scores were very heterogeneous). A particular type of 
standard score is the normal curve equivalent (NCE). NCEs have a national average of 50 and a 
standard deviation of 21.06; they range in value from 1 to 99 and match percentiles and the normal 
curve at 1, 50, and 99 (see Exhibit 2). NCEs can be mathematically manipulated (added, averaged, 
and combined) because the scores are equal-sized units (see Exhibit 2). NCEs do allow careful com- 
parisons across students, across years, across content areas, and, to some extent, across tests from 
different publishers. An NCE score below 20-25 on a multiple choice test with 4-5 response options 
for each item, is referred to as a “chance” score; that is, students could achieve this score without 
looking at the items on the assessment, but just randomly picking one of the response options. 

Another frequently used standard score is the scaled score. Various test publishers have cre- 
ated their own unique scales that cannot be described in great detail here. Suffice it to say that 
these are appropriate scores for use in an evaluation, but care should be taken when comparing the 
scaled score of one test to the scaled score of another test. 

Finally, gain scores are used to show how much students have progressed since a previous test- 
ing period. The usual method for calculating gains is to subtract a previous score from the current 
score, or a pretest from a posttest. This is problematic because no single assessment is perfectly 
valid, reliable, and fair. When gain scores are created, all of the technical problems in both the first 
testing and the second testing are contained in the single gain score, thus making it, in essence, 
doubly unreliable and potentially invalid and unfair. As an additional problem, the gain score may 
indicate progress, but does little to indicate how well the student actually is scoring (i.e., what does it 
mean to indicate that the average student gained 4 points?). 

Scores for CRTs 

Raw scores can be used to show mastery of an area (e.g., 8 of 10 items answered correctly indi- 
cates mastery). A more useful score often is the percentage correct, which provides more informa- 
tion, with a standard understanding of what the score means. Percent correct, when used to 
measure mastery, can be used within an evaluation to compare students on the same assessment, 
but cannot be used to compare scores across assessments (e.g., a score on a math achievement 
test cannot be compared to a score on a language arts achievement test) because the content and 
difficulty level are different. 

Other scoring types that were described in the section on NRT scores can be used with CRTs. 
However, this would make it difficult to compare a student’s score against a specific criterion, which 
usually is the purpose of a CRT. 

Scores for alternative assessments 

For many types of alternative assessments, different scoring methods can be used. Three typi- 
cally used scoring techniques are 

■ Holistic scoring, which provides a general, overall score for a piece of work - usually with a range 
from 0-3 up to 0-10 points; 

■ Primary trait scoring, which defines particular features (or traits ) of a performance and then pro- 
vides separate scores for each trait (e.g., spelling, grammar, voice, sentence structure, and con- 
tent are some of the traits of writing), usually each scored with values from 0 to 4 or 5; and 



■ Analytic scoring, which assigns a weight based on the importance of each trait (e.g., sentence 
structure may be twice as important as spelling). 

Exhibit 3 provides example uses for each of these scoring rubrics. 

Exhibit 3: Using rubrics 

Students have completed a writing sample. The instructor can score the writing sample using holistic, pri- 
mary trait, or analytic methods, depending on his/her purpose. 

Holistic score: the writing sample will be scored on a basis of an overall score of 1-5 (a zero indicates no 
response at all). Descriptors need to be created for at least the 1-, 3-, and 5-point scores. 

1 = limited evidence of achievement, writing skills below grade level 

3 = adequate achievement, a general ability to write at an acceptable level for this grade level 
5 = excellent achievement, writing is above what would be expected for this grade level 
These scores would be applied to the writing sample as a whole. 

Primary trait score: the writing sample is reviewed for specific elements (or traits) of writing. Each element 
is scored separately and, as the class focuses on different elements across the school year, what is scored 
can be changed. The scores could be the same 1-5 (again with 0 indicating no response at all) described 
above. The teacher might then score grammar, word choice, spelling, sentence structure, ideas/content, 
and conventions(e.g., indent each paragraph, capitalize first letter in sentence). A student then would re- 
ceive a score for each trait, or a total score ranging from 6 to 30. 

Analytic score: The scoring of the writing sample builds on primary trait scoring. The same rubric (1-5) can 
be used, and the same traits can be scored. However in analytic scoring each trait is given a value. For 
instance, spelling may be considered less important, with a value (or weight) of 1, with grammar of moder- 
ate importance with a value of 2, and ideas of great importance with a value of 3. The scoring for a par- 
ticular child then might be as in the chart below. 



Trait 


Score 


Value/weight 


Total score 


Spelling 


3 


1 


3 


Grammar 


2 


2 


4 


Word choice 


4 


2 


8 


Sentence structure 


4 


2 


8 


Ideas/content 


5 


3 


15 


Conventions 


4 


1 


4 


Total score for writing sample 


42 



Scores should be maintained separately for each trait (in order to determine progress in each area) and 
can be summed to provide an overall score. 



Scores for language proficiency assessments 

Many language proficiency assessments offer a variety of scoring mechanisms; they can be 
scored using NCEs, percentiles, raw scores, and so on. Most typically, however, raw scores are con- 
verted to a categorical score, or a level of proficiency such as “fluent English speaker,” or “compe- 
tent English writer.” There are two problems with using categorical scores: 

1. Many raw scores “fit” in each category, making it difficult to see smaller increments of 
growth; and 

2. Using cut-off scores to create categories means that a student’s language proficiency can be 
under- or over-represented based on one item missed or one item guessed correctly. 




Raw scores should be collected and maintained for evaluative purposes, but categories can be used 
to describe the language proficiency of groups of students. 

Using and reporting test scores 

Many test scores are meaningless when presented without supporting information. As an exam- 
ple, stating that “the average score on a test was 35” or “Joel’s score was 40” has little impact. 
However, adding that “on a test with a possible score of 0-50, these students’ scores ranged from 
28 to 45 with an average of 35 and Nancy scored 42” gives a great deal more information. This type 
of information should be provided whenever reporting scores. 

An additional problem in interpreting scores may be the type of score presented. However, it is 
possible to change the type of test score that is reported. That is, if test scores have been recorded 
in students’ files as percentiles, it is possible to change these to more usable scores such as NCEs. 
Most test manuals will provide a conversion table that includes typically reported scores, such as raw 
scores, grade equivalents, percentiles, NCEs, stanines, and so on. The table provides equivalencies 
among the scores. Exhibit 4 provides an example conversion table for scoring mechanisms. By 
reading across information in Exhibit 4 that describes student performance in the first month of 
fourth grade (i.e., grade 4.1), scores can be transformed from a raw score of 11 to a stanine of 3, an 
NCE of 30, and a percentile of 17. The final columns indicate that a score of 11 is a grade equivalent 
(GE) of 2.7 and an extended scale score (ESS) of 430. Note that because the raw test scores range 
from 1 to 45, and percentiles and NCEs range from 1 to 99, some percentile and NCE scores are not 
on the table (e.g., the jump from 8 NCEs to 13 NCEs or from the 33 rd percentile to the 39 th percen- 
tile). 



Exhibit 4: Example conversion table for scoring mechanisms 



Grade 4 


| Grade 5 1 




£ October February May 

n | Grade 4.1 Grade 4.5 Grade 4.8 

Raw <s 


£ October February 

n | Grade 5.1 Grade 5.4 

Raw <s 


May 

Grade 5.8 


All Grades 

Raw 


Score w NCE PR NCE PR NCE PR 


Score 05 NCE PR NCE PR 


NCE PR 


Score GE ESS 
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-- 
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-- 
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-- 


- 
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- 
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-- 


-- 


-- 


-- 
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-- 
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3 
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361 
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3 
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- 
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1 


-- 


- 
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370 
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2 
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1 


1 
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5 


- 


380 


6 




13 
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7 
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406 
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24 


11 


19 
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15 


5 




9 




15 


5 






9 


- 


415 


10 


3 


27 


14 


22 


9 


19 


7 




10 


2 


19 


7 






10 


2.5 


424 


11 




30 


17 


26 


13 


23 


10 




11 




21 


8 






11 


2.7 


430 


12 




33 


21 


29 


16 


26 


13 




12 




24 


11 






12 


2.8 


437 


13 




36 


25 


32 


20 


29 


16 




13 




27 


14 






13 


3.1 


443 


14 
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39 


30 


35 


24 


32 


20 




14 


3 


29 


16 






14 


3.3 


449 


15 




41 


33 


37 


27 


34 


22 




15 




31 


18 






15 


3.5 


454 


16 




44 


39 


39 


30 


36 


25 




16 




33 


21 






16 


3.7 


459 


17 




46 


42 


41 


33 


38 


28 




17 




35 


24 






17 


3.8 


464 


18 




48 


46 


43 


37 


40 


32 
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37 


27 
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50 
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41 
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39 


30 
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4.1 
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52 
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43 


37 
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41 
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20 


4.2 


478 


21 




54 
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49 


48 


45 


41 
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43 


37 






21 


4.4 


482 
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56 


61 


51 


52 


47 


44 
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45 


41 


z 
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4.5 
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53 


56 


49 
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44 
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55 
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51 
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CO 
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57 
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53 
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50 
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z 
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5.1 
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64 


75 


59 


67 


55 


59 




26 




52 


54 






26 


5.3 
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27 




53 
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78 


o± 
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57 
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55 
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60 
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28 
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7 


71 


84 






62 
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86 
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64 
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57 
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68 
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59 
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88 
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OJ 
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32 




78 


91 


75 


88 


68 
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32 


6.5 
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33 


8 


80 


92 


71 


84 
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66 


78 






33 
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83 


94 


77 


90 


73 
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34 




68 


80 






34 


7.0 


545 


35 




85 


95 


80 


92 


76 


89 




35 


7 


71 


84 






35 


7.3 


551 


36 




87 


96 


82 


94 


79 


92 




36 




73 


86 






36 


7.5 


557 


37 




90 


97 


85 


95 


82 


94 




37 




76 


89 






37 
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564 


38 




93 


98 


88 


96 


84 


95 




38 


8 


80 


92 






38 


8.3 
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96 


99 


91 


97 


88 


96 




39 


83 


94 






39 


8.7 
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40 


g 


99 


99 


95 


98 


91 
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40 
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- 
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96 
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41 




91 
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-- 
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g 
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- 


-- 
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99 
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-- 


-- 


-- 


-- 


-- 


-- 




44 




- 


- 






44 


11.9 


637 


45 








- 




-- 


-- 




45 




-- 


-- 






45 


12.8 


652 
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Definitions for NCLB: Assessment 






It is clear from these brief descriptions that even some of the more common scoring techniques 
are not particularly useful for evaluation purposes. Exhibit 5 lists the scoring types discussed above 
and indicates whether or not they can be used to describe general performance and/or can be used 
in computations for an evaluation. In several cases, comparisons can be made to the norm group, 
the large number of students who took the assessment at the behest of the test publisher. These 
students form the “standard,” it is their scores that establish percentiles, NCEs, and so on. If a pro- 
gram is hoping that their students will “look like,” or have scores similar to, the national average 
score on a particular NRT, then the norm group is an appropriate comparison. 



Exhibit 5: Test scores and their uses 



Scores can be used for 


Type of score 


Score compares students against 


Evaluation 


Description 


Raw scores 


Nothing, there is no comparison 


Yes 


Yes 


Percent correct 


Standard of 100% correct 


Yes 


Yes 


Grade equivalents 


Norm group 


No 


Perhaps 


Standard scores, includ- 
ing NCEs 


Norm group 


Yes 


Yes 


Sta nines 


Norm group 


Not suggested for any purpose 


Percentiles 


Norm group 


No 


Yes 


Gain scores 


Their own previous score 


No 


Perhaps 


Mastery scores 


Criterion for acceptable skills, knowledge 


Yes 


Yes 


Categories or levels 


Norm group considered in each level 


No 


Yes 



Finally, a comment on cut-off scores. Cut-off scores typically are used to categorize scores into 
groupings of similar students (e.g., high, medium, and low readers or students who have attained 
below basic proficiency, basic proficiency, proficiency, or advanced proficiency levels). These scores 
can be harmful to the life of the student when rigidly enforced and used to make life decisions. For 
instance, consider the recent growth in testing to ensure that students have met minimum criteria to 
graduate from high school. In most cases, there is one test that is administered to students that is 
the final determination of graduation status - those who pass, graduate, those who do not pass may 
have the opportunity to take the test again or they do not officially graduate. We would suggest the 
following procedures when creating cut-off scores: 

1. Consider carefully whether a cut-off score truly is needed; 

2. Spend time with consultants, researchers, and staff to determine a cut-off score that is ap- 
propriate; 

3. Create a “gray area” for borderline students - e.g., if 60% correct is the cut-off score, then 
scores between 58% correct and 62% correct are in the “gray area;” 

4. Determine further assessments or reviews of past accomplishments to determine whether 
students in the “gray area” pass or fail the assessment; and 

5. Communicate these guidelines clearly to students so that they understand the options. 
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Technical qualities of assessments 

Meaningful assessment is essential. To ensure that an assessment is meaningful, three factors 
must be considered: reliability, validity, and fairness. While psychometricians still argue about the 
relative importance of each of these concepts and what constitutes “good” reliability, validity, and 
fairness, some general explanatory statements can help to clarify these assessment qualities. There 
also are several references within Title I and Title III of NCLB to assessments that are 

4* “high-quality” (§1001(1), §llll(b)(3)(A), and §1112(b)(l)(A)), 

4 - aligned to state content and achievement standards (§llll(b)(2)(A)), 

4 - improved (§3115(c)(2)(A)), and 

4* valid and reliable (§llll(b)(2)(D)(l), §llll(b)(3)(C)(ix)(lll), and §3121(a)(3)(B)). 



Reliability 

Reliability is the stability or consistency of an assessment. For instance, two assessments of a 
student, performed at the same time, should show similar results; two reviews of a teacher's 
qualifications should result in similar conclusions. An instrument must be reliable if it is to be used 
to make decisions about how well a participant is performing or how well a staff development 
program is succeeding. As a general rule, the more items on an assessment, the greater the 
reliability. An assessment with 50 items will be more reliable than an assessment with 10 items; 
however, an assessment with 300 items may fatigue the test-takers and be very unreliable. Most 
psychometricians agree that at least 10 items are needed for each area tested (i.e., the various 
subareas in a language arts achievement assessment should each have at least 10 unique and 
separate items) in order to have a reliable instrument. 

Reliability is measured on a scale from 0.0 to 1.0, with higher numbers being better (i.e., more 
reliable) - although it is virtually impossible to achieve a rating of 0 or 1. Most psychometricians 
agree that a reliability coefficient of at least 

i .80 is needed if a test will be used to make decisions about a single individual; 

4 - .65 is needed if a test will be used to make decisions about a group of individuals such as a 
classroom; and 

4 - .50 is needed if a test will be used to provide some general information about how well a group 
of individuals is performing. 

Validity 

Validity is more difficult to describe, in part because psychometricians are changing their own 
views of validity. The newer view is that validity asks whether the interpretation, uses, and actions 
based on assessment results are appropriate. It is especially important to consider the communica- 
tive competence of learners when creating a valid test. In addition, the specific purpose of the as- 
sessment must be considered. An assessment may be valid for one purpose, but not for another. 
Basic questions when considering validity are “Does this test measure what it purports to meas- 
ure?”, “Do I believe what this test tells me about my learners?”, and “Are the results of this assess- 
ment similar to results from other assessments of the same topic?” 



Fairness 

Fairness refers to testing that considers the language, gender, culture, and overall abilities of the 
test-takers. For instance, if it is known that a group of test-takers have difficulties writing in English, 
then a fair test will include response options that allow the students to create pictures or graphs to 
show their answers or that allow them to dictate answers to a fluent English person. Fairness is im- 
pacted by how items are developed, the scoring procedures used (as well as the training of scorers 
and the calibration of scores), access to good instruction, and so on. 

Fairness also should ensure that biases are not evident in the testing procedures or test items. 
Biases generally fall into three areas: 

■ biases in item development or scoring procedures that unjustly promote or oppose an individ- 
ual’s race/ethnicity, culture, language, or beliefs; 

■ stereotyping within items or reading passages based on race or ethnicity, language, culture, or 
physical ability through under/ over representing or ridiculing certain groups; and 

■ illustrations that negate the impact of certain individuals, typically by not including them. 

Biases can be quite subtle. For instance, if items on an assessment only use names that are typi- 
cally associated with Anglos, there is a relatively subtle bias for one group (Anglos) and against oth- 
ers (e.g., Asians, Hispanics) who choose to maintain culturally-appropriate names. These issues can 
impact students’ interest in subject matter as well as their interest in achieving on an assessment. 



Conclusions: Making assessment meaningful and useful 

There is no doubt that we must assess all students in order to determine their educational pro- 
gress. Assessment provides important information for accountability to teachers, administrators, 
parents, the community, and the students themselves. However, we must be careful to align the 
need for accountability with quality instruction and an assessment system that is appropriate for all 
students. When accountability includes high-stakes decisions about grade promotion/retention, 
placement in core content classes, or academic achievement, it is imperative that the system ad- 
dress the unique characteristics and needs of all students: students living in poverty, English lan- 
guage learner students, culturally diverse students, high achievers, and so on. Only then can we 
determine the best way to assess all students’ achievements and make instructional decisions. 

Assessment leads to accountability by informing various stakeholders about student progress. In 
order to ensure that accountability is meaningful, an assessment system should include NRTs, CRTs, 
and alternative assessments. In addition, there must be a systematic process of identification, 
placement, continuously monitoring progress, transition from ESL or dual language support into Eng- 
lish-only classrooms (if appropriate), and inclusion of all students in the full assessment system. 
Thus we must align the need for accountability and quality instruction with an assessment system 
that is appropriate for all students. 

NCLB states that scores of assessments of both academic subjects and language proficiency 
shall be used by the Local and State education agency “for improvement of programs and activities; 
to determine the effectiveness of programs and activities in assisting [ELL] students to attain English 
proficiency and meet challenging State academic content and student academic achievement stan- 
dards..." (§3121(b)(l)(2)). Thus we must always ensure that 

4- the assessments used with all students are of the best and highest quality possible; 

+ multiple assessments are used, especially when making major life decisions; 

4- scores are maintained across time so that progress can be followed carefully; and 
4- interpretations of scores are made with wisdom and understanding. 

In order to follow these mandates, we suggest that the following elements are essential within 
any assessment system used for accountability purposes: 

4 use multiple assessments of different types (e.g., an NRT, a CRT, and an alternative assess- 
ment); 

4- ensure that all assessments are reliable, valid, and fair; 

4- create a policy that indicates when students should be tested in what language(s); 

4 assure that staff development activities, curricula, expected teaching techniques, and assess- 
ments are aligned; 

4- do not use one assessment, of any type, to make a life-decision (e.g., program placement, 
graduation); 

4- provide annual training sessions for those who administer and score assessments; 

4 - maintain long-term data for each student (that is, keep scores from past years as well as this 
year’s scores); 

* when using an NRT, read the technical manual to determine the ethno-linguistic groups who par- 
ticipated in the norming process; 

i- review each assessment - items, paragraphs to be read, response options, and scoring tech- 
niques - for biases and stereotyping; 



4- maintain data in the raw form, or at least in as detailed a form as possible (i.e., do not keep 
categorical information - it is easy to create categories from the raw data, but not vice versa; 
and 

* review assessments often to ensure that they continue to meet the needs and policies of the lo- 
cal school district. 



It may be helpful to create a team or advisory panel that helps make assessment decisions. 
Such a panel should include administrators, teachers, paraprofessionals, parents, and community 
members; at upper grade levels, students may be included as well. This group could be given the 
mandate to review 

* the alignment of curricula, instruction, and assessment; 

4- selection and/or development of assessment(s); 

4- new assessments for bias; 

* cut-off scores for specific purposes; and 

4- generally ensure that assessments are used in an appropriate manner. 

This document has been fairly short while attempting to encompass a difficult and complex topic. 
For those who wish to learn more on any of these aspects of assessment, a reference list and bibli- 
ography follows. Some of these books are new, some are “classics” in the field. In addition, most 
colleges and universities offer classes in “tests and measures” or “psychometrics” in schools of edu- 
cation and/or psychology. 
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